Modified character-level deciphering algorithm for OCR in degraded documents
نویسندگان
چکیده
Modi cations to a previous character level deciphering algorithm for OCR are presented in this paper that are able to handle touching characters and are tolerant to mistakes made at the clustering stage The objective of a character level deciphering algorithm is to assign alphabetic identities to character patterns such that the character repetition pattern in an input text matches the letter repetition pattern provided by a language model Degradation in document images usually causes the occurrence of touching characters and mistakes in clustering the character patterns which pose di culties for character level deciphering algorithms The modi cations proposed in this paper tightly integrate visual constraints from characters and touching patterns with constraints from a language model This solves the problem of touching characters and reverses clustering mistakes The provides a deciphering algorithm with robust performance under image degradation
منابع مشابه
Performance Evaluation of Two Arabic OCR Products
Numerous Optical Character Recognition (OCR) companies claim that their products have near-perfect recognition accuracy (close to 99.9%). In practice, however, these accuracy rates are rarely achieved. Most systems break down when the input document images are highly degraded, such as scanned images of carbon-copy documents, documents printed on low-quality paper, and documents that are n-th ge...
متن کاملDegraded Document Analysis and Extraction of Original Text Document: An Approach without Optical Character Recognition
Document Image Analysis recognizes text and graphics in documents acquired as images. An approach without Optical Character Recognition (OCR) for degraded document image analysis has been adopted in this paper. The technique involves document imaging methods such as Image Fusing and Speeded Up Robust Features (SURF) Detection to identify and extract the degraded regions from a set of document i...
متن کاملAlgorithms for postprocessing OCR results with visual inter-word constraints
Algorithms are presented that determine the visual relationships between word images in a document. These include instances of common word images and common substrings that occur often in English language text images. This information is then be used to improve the performance of a commercial optical character recognition (OCR) algorithm. The algorithms presented here calculate clusters of equi...
متن کاملOCR of Degraded Documents using HMM-Based Techniques
We present an OCR system for handling degraded documents, such as faxed text. The basic system utilizes the BBN BYBLOS OCR system, which uses a Hidden Markov Model (HMM) approach for training and recognition. To handle degraded documents, we present two approaches, which can be applied individually or jointly. In the first approach, we train the system on documents that exhibit the expected kin...
متن کاملExtraction of Original Text Document from a Set of Degraded Text Documents from the Same Source
Information extraction is the task of extracting structured data from a degraded document. It includes data extraction such as text, image or graphics from the sources such as an image, video or documents. Text detection and extraction from the degraded document finds application in wide range of study. In this paper, an Optical Character Recognition less (OCR-less) method of obtaining an origi...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1995